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(N 
£h Abstract 

•^ In this paper we provide a new geometric characterization of the Hirschfeld-Gebelein-Renyi maximal correlation 

of a pair of random (X, Y), as well as of the chordal slope of the nontrivial boundary of the hypercontractivity 

-^j ribbon of (X, Y) at infinity. The new characterizations lead to simple proofs for some of the known facts about 

these quantities. We also provide a counterexample to a data processing inequality claimed by Erkip and Cover, and 
find the correct tight constant for this kind of inequality. 



O 



I. Introduction 

There are various measures available to quantify the dependence between two random variables. A well-known 
r—i such measure for real-valued random variables is the Pearson correlation coefficient p p (X,Y) := cov( ' ' , which 
quantifies the linear dependence between the two random variables. A closely related measure, called the Hirschfeld- 
~f^ Gebelein-Renyi maximal correlation, or simply the maximal correlation, measures the cosine of the angle between 
H the linear subspaces of mean zero square integrable real-valued random variables defined by the individual random 
y *"0 variables, as below. 

Definition 1: Given random variables X and Y, the Hirschfeld-Gebelein-Renyi maximal correlation of (X,Y) 

^ is defined as follows: 

Pm (X;Y):= max E[f(X)g(Y)}, (1) 

{f(X),g(Y))eS 
> 
>-h where S is the collection of pairs of real-valued random variables f(X) and g{Y) such that 



X 



E/pQ = Eg(Y) = 0, and Ef 2 (X) = Eg 2 (Y) = 1. 

If S is empty (which happens precisely when at least one of X and Y is constant almost surely) then one defines 
p m {X;Y) to be 0. □ 

This measure, first introduced by Hirschfeld Q and Gebelein (6) and then studied by Renyi ifTTl . has found 
interesting applications in information theory. 

As a general remark, to stay clear of technicalities, we restrict ourselves throughout this paper to discrete random 
variables (X,Y) taking values in X x y with 1^1,3^1 < oo. Further we assume that P(X = x) > 0,Vx G X and 
P(Y = y) > Vy, G y. We will use := and occasionally =: for equality by definition. 

Definition 2: For any real-valued random variable X and real number p ^ 0, define ||X|| P := (E|X| p )p. Define 
||X|| :=exp(E(log|X|)). For p< 0, ||X|| P = if F(\X\ = 0) > 0. ' □ 

Renyi fTTl derived an alternate characterization to p m (X,Y) as follows: 

p m (X;Y)= max ||E[/(X)|Y]|| 2 . (2) 

Maximal correlation has interesting connections to the hypercontractivity of Markov operators, as demonstrated 
by Ahlswede and Gacs in 01. 
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Fig. 1. The blue curve is an illustration of Qx : y(p) (this curve is not convex in general). The brown line represents the 'chorda!' slope 



p-1 



as p 



oo, which turns out to be s*(X;Y). The red line is the slope of Qx ; y(p) at (1, 1) defined by liirip^i X 'Y , and 



turns out to be s*(Y; X). The purple line passes through (1, 1) and has slope p^ n (X\ Y). 
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Definition 3: For p > 1 define 

q* x . Y (p) : = inf{q : \\E[g(Y)\X]\\ p < \\g(Y)\\ q V 5 : y M- R}. □ 

Remark 1: Ahlswede and Gacs [1] characterize hypercontractivity in terms of s p (X,Y) := x '^ , for p > 1. 
If r(x) and p(x) are probability distributions on the same finite set, we write D(r(x)\\p(x)) for the relative 
entropy distance of r{x) from p(x), i.e. 

r(x) 



D(r{x)\\p{x)) := > ^r(x) log 



p(x) 



To proceed to discuss the results of this paper, we need the following definition. 
Definition 4: Let X and Y be random variables with joint distribution (X, Y) 



p(x,y). We define 



s*(X;Y):= sup D ^ y)My) ) 
r(x)^ P (x)D(r(x)\\p(x)) 



(3) 



where r(y) denotes the y-marginal distribution of r(x,y) := r(x)p(y\x) and the supremum on the right hand side 
is over all probability distributions r(x) that are different from the probability distribution p(x). If either X or Y 
is a constant, we define s*(X; Y) to be 0. □ 

Remark: From the data processing inequality for relative entropies it is immediate that s*{X;Y) < 1. Further, 
s*(X; Y) can be regarded as a function of the input distribution p{x) corresponding to a channel p(y\x). 

Below, we outline some of the properties of c\* x . Y (p) combining results from ifm and from Theorems 3 and 5 
in |D. 

Theorem 1: The following statements hold: 

(a) For any fixed p > 1, q^-.y(p) > 1 with equality if and only if X and Y are independent. 

(b) q* x . Y (l) = 1 and - J£jX - — is monotonically decreasing in p. 

(c) q ^ T ^>/4(X;y). 

(d) The chordal slope of q^.y(p) at infinity, defined by lim p , 
qxiy(p)- 1 



>oo qx: 1 — , exists and is equal to s*(X; Y). 



(e) lim.pi! 



p-i 



s*(Y;X). 



Remark 2: Hypercontractive inequalities (and their counterpart for p < 1, called reverse hypercontactive inequal- 
ities) also play an important role in analysis, probability theory, and discrete Fourier analysis. Interested readers 
can refer to the introduction in [16] for a brief summary of their development and impact in these areas. For results 
and applications of hypercontractivity and reverse hypercontractivity in information theory, interersted readers can 
refer to JTT|. 



In this paper we will provide alternate characterizations of both p 2 n {X, Y) and s*(X; Y). Fix a channel p(y\x), 
fix A 6 [0, 1], and consider the functioirlof the probability distribution of X denoted by t\(X) which is defined 
by 

t x {X):=H{Y)-\H(X). 

We will show in Theorem ffl that p 2 n {X,Y) is the smallest A such that t\{X) has a positive semidefmite Hessian 
at p(x) and s*(X; Y) is the smallest A such that t\(X) matches its lower convex envelope, denoted by JC[t\](X), 
at p(x). 

In [0 Theorem 8] it was claimed that the following inequality holds: 

I(U; Y) < p 2 m (X; Y)I(U; X),VU-X-Y. 

It turns out that this inequality is incorrect; we will provide a counter example in this paper. Further we will show 
(Theorem [4]) that the following inequality holds, with a tight constant: 

I(U; Y) < s*{X; Y)I(U; X), V U -X -Y. 

The error in the proof in [4, Theorem 8] seems to be a subtle, yet significant one. A similar error has also 
occurred in ifTOll . where the authors independently rediscover the erroneous result of El Theorem 8] using similar 
techniques. 

A. Alternate characterizations of the Hirschfeld-Gebelein-Renyi maximal correlation 

In this section we will review some alternate characterizations of the Hirschfeld-Gebelein-Renyi maximal corre- 
lation which are known in the literature. 

1) Renyi's characterization: : As mentioned earlier, Renyi derived the following "one-function" alternate char- 
acterization for p m (X;Y) iTTTl : 

p 2 m (X;Y)= max E[E[/(X)|Y] 2 ]. (4) 

The validity of this characterization can be proved by fixing / with E[/(X)] = and E[f 2 (X)} = 1 and showing that 
setting g(Y) = cM[f(X)\Y] maximizes E[f(X)g(Y)} among all functions g with E[g(Y)} = and E[g 2 (Y)} = 1 
when a > 1 is chosen so that a 2 E(E[/(X)|Y] 2 ) = 1. This is a simple consequence of the Cauchy-Schwartz 
inequality. 

2) Distribution simulation characterization: Consider a random variable X' such that X — Y — X' is Markov 
and (X,Y) = (X',Y). Then 

p 2 (X;Y)= max E[f(X)f(X% (5) 

f(X):Ef(X)=0,E[P(X)]=l 

This result follows from Renyi's characterization which was given in (|4]) above. Since (X,Y) = (X',Y), we 

have E[f(X)\Y] = E[f(X')\Y]. Hence E[E[/ \X)\Y} 2 } = E[E[f(X)\Y]E[f(X')\Y]} ( = } E[E[f (X) f (X')\Y]] = 
E[f(X)f(X% where (a) holds because X-Y-X'. 

3) Singular value characterization: For finite valued random variables maximal correlation p m {X; Y) can also be 
characterized [18] by the second largest singular value of the matrix Q with entries Q x v = p x ' y > . This result can 

be seen by writing E[f(X)g(Y)] as ^^(/(x) ^Jp{x))Q{x, y) (g(y) \fpjyj), observing that Y, x ^/p{x)Q{x,y) = 
\/p(y) and ^ Q{x, y)y/p(y) = \/p(x), and that the conditions E[/(X)] = and E[g(Y)] = are respectively 
equivalent to requiring that x i-> f{x)\Jp(x) is orthogonal to x \-t \/p(x) and that y h-> g(y)\/p(y) is orthogonal 
to y ^ \/p(y)- 

There is a simple formula for p m (X;Y) if at least one of X or Y is binary-valued, which is most easily seen 
by using the singular value characterization: 

p{x,y) 2 



P z m (X;Y) 



E 



y p{x)p{y) 



1. (6) 



'We abuse notation when we write t\(X). We really wish to think of t\ as a function of the probability distribution of X. 



This follows from observing that p 2 n {X;Y) is the second largest eigenvalue of both QQ T and Q T Q. If one of 
these is a 2 by 2 matrix, we can find the second largest eigenvalue by computing the trace and subtracting the 
largest eigenvalue, i.e. 1, from it. 

B. Properties of p m (X;Y) 

In this section, we will present some known properties of the maximal correlation p m (X; Y). 

1) Tensorization of p m (X; Y): The following theorem shows that maximal correlation tensorizes. It was proved 
by Witsenhausen in lfl"8ll . For a function of probability distributions to have the property of the first sentence of the 
theorem is what it means to say that it tensorizes. 

Theorem 2: (Witsenhausen [18]) If (Xi, Yi), (X 2 ,Y 2 ) are independent, then 

p m (X 1 ,X 2 ;Yi,Y 2 ) = max{p m (Xi]Yi),p m (X2]Y 2 )}. 

In particular if (Xi , Yi) , {X 2 , Y 2 ) are i.i.d., then p m (X 1 , X 2 ; Yi , Y 2 ) = p m (Xi; Y{) . □ 

The elegant proof in lfl4l (for finite valued random variables) uses the singular value characterization and 

is reproduced below. When (X\,Y{) is independent of {X 2 ,Y 2 ) it is immediate that the matrix Q defined by 

Q x x v v = Pi(zi,yi)P2(x-2,y2) j s ,-^g Kronecker product of the corresponding individual matrices Q x v = 

pi(Xi,yi } == an( j q _ P2(x 2 ,y 2 ]^^ 1 e q _ Q(g)Q_ it j s lojown that the singular values of Q are given as the set 

\/Pl(xi)Pl(Vl) \/p2(X2)p2(V2) 

of products of one singular value of Q with one singular value of Q. Since the largest singular values of each of the 
three matrices is unity, it is immediate that the second largest singular value of Q is max.{p m (Xi;Yi), p m (X 2 ; Y 2 )}. 

Witsenhausen [18] showed that the maximal correlation of two random variables gives the answer to the following 
problem: consider two agents, the first of whom observes X n , while the second observes Y n , where pQ, Yi), 1 < 
i < n, are i.i.d. copies of (X, Y). Each agent makes a binary decision based on the sequence available to it. The 
entropy of the each binary decision should be bounded away from zero by a constant. Witsenhausen showed that 
the probability of agreement between these decisions can be made to converge to 1, as n converges to infinity, if 
and only if p m (X; Y) = 1. This is a version of the main result in the path-breaking work of Gacs and Korner [5], 
which introduced the concept of Gacs- Korner common information. 

Erkip and Cover J4[ studied the problem of investment in the stock market with side information of limited rate 
with the aim of quantifying the value of the side information in improving the growth rate of wealth. In one part 
of their much broader contribution, they present a data processing inequality which claims that 

I(U; Y) < p 2 m (X; Y)I(U; X),\/U-X-Y 

where p m (X;Y) is the Hirschfeld-Gebelein-Renyi maximal correlation between the random variables X and Y. 
As we stated earlier, this inequality is incorrect. 

Kang and Ulukus illustrated some applications of maximal correlation in distributed source and channel coding 
problems [12]. Beigi has introduced a quantum version of the maximal correlation for bipartite quantum states, and 
has shown that this measure fully characterizes bipartite states from which common randomness distillation under 
local operations is possible [2]. 

Recently Kamath and Anantharam [11] have used maximal correlation to study the problem of non- interactive 
simulation of joint distributions. They also used hypercontractivity and reverse hypercontractivity to show that under 
certain conditions these can provide stronger impossibility results for the simulation problem than those obtained 
by maximal correlation. 

C. Alternate characterization and properties o/q^.y(p) 

In [11], the authors defined the following region which can be used to characterize q* x . Y (p). 
Definition 5: For a pair of random variables (X, Y) ~ p(x, y) on X x 3^, the hypercontractivity ribbon is the 
subset 

n X -,Y C {(p, q) e R 2 : 1 < q < p or 1 > q > p} 

defined b)Q 

2 This characterization of the hypercontractivity ribbon is given in 1111 . Another characterization, which is closer to how hypercontractivity 
is normally discussed in the literature, will be mentioned later. 



. (1,1) en X]Y ) 

• For 1 < q < p, (p, q) G 1Z X -Y iff 

Ef(X)g(Y)<\\f(X)\\ p/ \\g(Y)\\ Ci Vf:X^R,g-.y^R; (7) 

• For 1 > q > p, (p, q) € 1Z X -y iff 

Ef(X)g(Y) > \\f{X)\\^\\g(Y)\\ q V/:^^(0,oo), 5 :^^(0,oo). (8) 

When 1 < q < p, inequalities such as ([7]) are referred to in the literature as hypercontractive inequalities and 
when 1 > q > p, inequalities such as ((8) are referred to as reverse hypercontractive inequalities. □ 

Then one can alternatively define q x . Y (p) according to 

qx ; y(p) : = inf {q : (q. p) e n x . Y }, p > 1. 

The equivalence of this characterization to that in definition [3] is proved in ifTTTl . The proof is similar to that of 
Renyi's alternate characterization of p m (X,Y) and is a straightforward application of Holder's inequality. 

Likewise, we can define TZy-,x- In general, TZ X -y / T^Y-,x, but the two are related by an intimate duality 
relationship that is clear from §7§ and ([8]): 

(p,q)e^x ; y «=> (q', p') € Tly-x- 

Using this duality relationship El establishes that lim p ^i q * x ^ ip ^ 1 = £q* x . Y (p) = s*(Y; X). 

" " ' p=i 

Remark 3: In general, s*(X;Y) / s*(Y;X) as shown by the following example. Let (X, Y) be 0-1 valued 

with P(X = 0) = 0.85, F(Y = 0) = 0.39, P(X = Y = 0) = 0.36. Then, computation gives us s*(X;Y) = 

0.045..., s*(Y;X) = 0.029.... 

Most of the applications of hypercontractivity traces its roots to the following tensorization property of the 
hypercontractive ribbon. 

Theorem 3: (Q, |fj6l) If (Xi,Yi) and (X 2 ,Y 2 ) are independent, then H(Xi,X 3 );(Xi,Y a ) = ^Xi;>i n7£x 2 ;Y 2 - In 
particular, if (X x , Y{) and (X 2 ,Y 2 ) are i.i.d., then 11(x u x 2 );{y u y 3 ) = K Xl - Yl . □ 

Theorem [3] can be thought of as saying that the whole hypercontractivity ribbon tensorizes, since it says that for 
each (p,q) we have 

l((p,q) iK {XuX2) . {Yl y 2) ) = max{l((p,q) ^ n Xl . Yl ),t((p,q) $ Kx a] Y a )} ■ 

A consequence of this then is that s*(X;Y) tensorizes, i.e. for (Xi,Y\) and (X 2 , Y 2 ) independent, 

s*(X 1 ,X 2 ;Y 1 ,Y 2 ) = ma X {s*(X 1] Y 1 ),s*(X 2 -Y 2 )}. 

We will give a alternate proof of the tensorization of s*(X;Y) later using our new characterization involving the 
function t\(X) that was introduced earlier. A direct proof of this tensorization can be obtained as follows. The 
direction s*(X\, X 2 ;Yi, Y 2 ) > max{s*(.Xi; Yi), s*(X 2 ; Y 2 )} is immediate; hence we only show the non-trivial 
direction. Note that for any r(x\,x 2 ) ^ p{x\,x 2 ) we have 

D{r{ yi ,y 2 )\\p(yi,y2)) (g) ^(rfoOIMs/i)) + E yi r(yi)D(r(y 2 \yi)\\p(y 2 )) 
D(r(xi,x 2 )\\p(xi,x 2 )) D(r(x 1 )\\p(xi)) + Em r{xi)D(r(x 2 \xi)\\p{x 2 )) 

_ D(r(yi)\\p{yi)) + E yi r(yi)D (Em r(xi\yi)r{y 2 \xi)\\p{y 2 )) 

D(r(xi)\\p(xi)) + Y, Xl r(xi)D(r(x 2 \xi)\\p(x 2 )) 
(6) D(r(yi)\\p{yi)) + E^ Ex, r (xi\yi)r(yi)D (r(y 2 \xi)\\p(y 2 )) 

D(r(xi)\\p(x 1 )) + Ex, r(xi)D(r(x 2 \x 1 )\\p(x 2 )) 
__ D(r(yi)\\p(yi)) + Em r ( x i) D {r(y2\xi)\\p(y2}) 
D(r{xi)\\p(xi)) + Em r(xi)D(r(x 2 \xi)\\p{x 2 )) 

<max{ S *(Xi;Yi),s*(X 2 ;Y 2 )}. 

In the above (a) uses the fact that p(x\y\,x 2 ,y\) = pi(x\,y\)p 2 (x 2 , y 2 ) and (b) uses the convexity of D(p\\q) in 
p. The last inequality follows from the definition of s*{X\; Y\),s*{X 2 ; Y 2 ), and our assumption that r(x\,x 2 ) / 
p(x\,x 2 ) which guarantees that at least one of the terms in the denominator is non-zero. Finally taking sup over 
all such r(xi,x 2 ) we obtain the non-trivial direction s* (X\, X 2 \Y\,Y 2 ) < max{s*(Xi; Y\) , s* (X 2 ;Y 2 )} . 
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Fig. 2. An asymmetric erasure channel. 



II. Main Results 

One of the main contributions of this paper is a correction to the data processing inequality claimed by Erkip 
and Cover in [21 Theorem 8]. We provide a counterexample to their claim and point out a location in their proof 
where the argument is incomplete. We then find the correct constant to get a tight data processing inequality of the 
type they considered. 

A. Counterexample to the Erkip-Cover data processing inequality 
In |@] Theorem 8], Erkip and Cover claimed that 

I(U;Y)<p 2 m (X;Y)I(U;X) (9) 

X — Y form a Markov chain. Furthermore they claimed that, p 2 n {X; Y) is the minimum such 

I(U;Y) 



holds whenever U 
constant, i.e. 



sup 



P z m {X;Y). 



(10) 



U: U-X-Y,I(U;X)>0 I{U\ X) 

We will first provide a counterexample to these claims and then point where there is a gap in their argument. In 
a subsequent subsection we will identify s*(X;Y) as the correct constant to replace p 2 n {X;Y) in Q and ( fTO] ). 

1) Counterexample to (|9]) and ( fTO] ).- Let X be a binary random variable with p(X = 0) = i. Define p(x,y) by 
passing X through the asymmetric erasure channel given in Fig. [2] Using Equation ([6]), one can verify for this pair 
(X,Y) that p 2 n (X;Y) = 0.6. 

Suppose we construct U satisfying U — X — Y such that U\X 
I(U;Y) = 0.055770... and I(U;X) = 0.09130. 



contradicts (10) 



c that J ( y ^) 
so mat jj^u) 



~ Ber(0.1),*7|X 
= 0.6108... > 0.6 



Pm 



■< Ber(0.4). Then 
(X;Y), and this 



It can be shown in a reasonably straightforward manner, using our characterization in TheoremHl that s* (X; Y) = 
\ 1°§2 (it) = 0.631517... for this pair of random variables (X, Y). Simulation shows that for a suitable sequence 

T(TT- V\ __i. fl ,„o ^on to«» _ 

I(Ur,X) 



of Ui with I(Ui]X) 
shown in the table below. 



0, we can have T } TI z '. y { approach s*(X;Y) for this example. A sequence of such Ui is 



P(U = 1\X = 0) 


P(U= l\X = 1) 


I(U-Y) 


I(U;X) 


I(U;Y) 
I(U;X) 


0.1 


0.4 


0.055770... 


0.09130... 


0.6108... 


0.01 


0.23 


0.062321... 


0.099958... 


0.6234... 


0.001 


0.102 


0.031038... 


0.049379... 


0.6285... 


0.0001 


0.04 


0.012507... 


0.019838... 


0.6304... 


0.00001 


0.01474 


0.0046418... 


0.0073545... 


0.6311... 


0.000001 


0.005232 


0.0016507... 


0.0026145... 


0.6313... 


0.0000001 


0.0018146 


0.00057285... 


0.00090716... 


0.6314... 


0.00000001 


0.00061973 


0.000195672... 


0.000309852... 


0.63150... 



The error of the Erkip-Cover proof seems to lie in their use of a Taylor's series expansion. Consider the expansion 
in the left column of page 1037 of their paper [4], where they use their equation (16) to expand around p(v). It is 
possible that p(v) is zero for some v and this causes an error as the derivative in this direction is infinity and the 
Taylor's series expansion is no longer valid. As our counterexample shows this seems to be a significant but subtle 
error that cannot be worked around. 

Some of the works that use this incorrect result of [4], such as Q and |fT9l , are affected by this error. A claim 
similar to that of [4], which appears in [10], is also falsejj 

B. A geometric characterization of p 2 n {X;Y) and s*(X;Y) 

Given p(x,y), we can treat p(y\x) as a channel, and then consider the function of the input distribution p(x), 
defined by 

t x (X):=H(Y)-XH(X), 

where A is a constant in [0, 1]. Observe that the function is concave when A = and convex when A = In 

We write K[t\](X) for the convex hull of t x (X). If fC[t x ](X) = t\(X) at p(x) for some A, then note that for 
any Ai > A 

IC[t Xl ](X) = IC[t x -(X 1 -X)H](X) 

>IC[t x }(X)-(X 1 -X)H(X). 

Here the inequality comes from JC[f + g] > /C[/] + K,[g\ and since — (Ai — X)H(X) is convex. Therefore at p(x) 
we will have that 

t Xl (X) > K[t Xl \{X) > K[t x ]{X) - (A! - X)H(X) = t x (X) - (A x - X)H(X) = t Xl (X). 

Thus we see that if JC[t x ](X) = t x (X) at p(x) for some A then K[t Xl ](X) = t Xl (X) at p(x) for all Ai > A. 

The following theorem gives a geometric interpretation of p 2 n {X\ Y) and s*(X; Y) in terms of the behaviour of 
the function t\(X) and identifies s*(X;Y) as the correct replacement for p 2 n {X;Y) in (|9]) and ( fTO] ). 

Theorem 4: The following statements hold: 

1) p 2 n {X; Y) is the minimum value of A such that the function t x (X) has a positive semidefinite Hessian at p{x). 

2) s*(X; Y) is the minimum value of A such that the function t x (X) touches its lower convex envelope at p{x), 
i.e. such that K,[t x ](X) = t x (X) at p(x). Furthermore, 

sup §^| = S *(X;y). 

U: U-X-Y,I(U;X)>0 J \ u > A J 

□ 

Proof of 1): This follows from Renyi's characterization of the maximal correlation, given in ((4]) above. Take 
an arbitrary multiplicative perturbation of the form p e (x) = p(x)(l + e/(x)). For p € to stay a valid perturbation 
we need ¥,[f(X)] = 0. Furthermore we can normalize / by assuming that E[/ 2 (X)] = 1. The second derivative in 
e of H(Y) - XH(X) is equal to H 

-E[E[f(X)\Y] 2 ] + XE[f 2 (X)} = -E[E[f(X)\Y] 2 ] + X , 

which is non-negative as long as A > E[E[/(X)|Y] 2 ]. Thus the minimum value A* such that the second derivative 
is non-negative for all local perturbations is 

A*= max E[E[f(X)\Y} 2 } = p 2 m (X;Y), 

where the last equality follows from Renyi's characterization of maximal correlation. ■ 

3 This paper studies the ratio I(u '. x \ when I(U; X) is very small. However, as pointed out in |4|, the supremum of I(u '. x \ occurs when 
I(U; X) — > 0. So the problem studied by [ 10] is the same as that of |4). 

4 This convexity at A = 1 follows from the fact that for any U — X — Y we have I(U; X) > I(U; Y) or equivalently H(Y) — H{X) < 
H{Y\U) - H{X\U). 



Proof of 2): Consider the minimum value of A, say X\ such that the function t\(X) touches its lower convex 
envelope at p{x). Thus, equivalently we are looking for the minimum A such that for (X,Y) ~ p(x,y) we have 

H(Y) - XH(X) < H(Y\U) - XH(X\U), V U : U - X -Y. 

Note that if U is independent of X, i.e. I(U; X) = then the above inequality is always true. Equivalently we 
require the minimum A such that, 

A > l)^ */ , V U : U-X-Y with I(U;X) > 0. 

Thus, 

A = SUP T(TT-XY 

U: U-X-Y,I(U;X)>0 L \ V i A ) 

Remark: Since t\{X) = fc[t\](X) at p(x) implies that the Hessian of t\(X) at p(x) is positive semidefinite, we 
have 

sup I JH^->pl{x ] Y). (11) 

It remains to show that A^ = s*(X; 1") or equivalently that 

p Ten- y\ ~ [ ~ X ' Y >- 

U: U-X-Y,I{U;X)>0 l \ u i A I 

From standard cardinality bounding arguments, it suffices to consider \U\ < \X\ + 1 to determine the value of 

I(U;Y) Rl 
SU P[7: U-X-Y,I(U;Y)>0 I(U;X) 

For any \U\ < \X\ + 1 maU - X - Y a Markov chain with I(U;X) > and X ~ p(x), denote P(f7 = 
ti) =: w u ,P(X = x\U = u) =: r u (x). Clearly ^2 u w u r u (x) = p(x). Let the channel-induced distributions on Y 
corresponding to the r u (x) be denoted by r u (y) respectively. Then elementary manipulations yield 

I(U;Y) _ Eueu : r u {x)^ P (x) w uD{ru{y)\\p{y)) < D{r{y)\\p(y)) 

I(U]X) J2ueu ■. r u (x)^ P (x) w uD(r u (x)\\p(x)) ~ r (x)^p(x) D(r{x)\\p{xj) ' 

where r(y) denotes the channel-induced probability distribution on Y corresponding to the probability distribution 
r{x) on X. 

Since the above holds for all U such that U — X — Y is a Markov chain and I(U; X) > 0, we have 

mYl D{r{y)\\p{y)) 

sup Ten- y\ - sup n( ( mi i \\ =s ( x ^ y )' 

JJ: U-X-Y,I(U;X)>0 J { U 1 X ) r(x) D{r(x)\\p{x)) 

where the last equality is by definition, see Q above. 

To show the other direction, we assume that s*(X; Y) > 0, else there is nothing to prove. Let 5 G (0, s*(X; Y)) 
be arbitrary. We also assume without loss of generality that p(x) > \/x G X and p{y) > Vy G 3^, since otherwise 
we could have simply changed the definition of X and y. 

Let U e := {1,2}. Fix a sufficiently small e > and define U e by: 

• w\ = e,r\(x) = r*(x), 

• w 2 = 1 - e, r 2 (x) = p(x) + jz^(p(x) - r*(x)) = jz^p(x) - jz^r*(x), 

where r*(x) / p(x) is a probability distribution satisfying — ) (■ > s*(X; Y)—S. For sufficiently small e > 0, 

D[r'(x)\\p(x)) 

we will have that r2(x) is a probability distribution. Note that ^1 + ^2 = 1 and w\r\{x) +W2V2{x) = p{x) Vx G X. 
Clearly I(U e ; Y) > 0, since I(X; Y) = would have implied that s*{X; Y) = 0. 
For any < A < s*(X ; Y) - 5 define the function 

g(e):=I(U e ;Y)-XI(U e ;X). 

5 Indeed our proof below indicates that even a binary U suffices. 




Fig. 3. Plot of p(x) i-> H(Y) — 0.6H(X) for the asymmetric erasure channel given in Fig.p] The X-axis is P(X = 0). The straight line 
is drawn to connect the value of the curve at P(X = 0) = to that at P(X = 0) = |, to visually demonstrate that this line is not tangent 
to the curve at P(X — 0) = |. 



We have 

dg(e) 

de 



■± (eH(r*(y)) + (1 - e)H i^- 



+ X^-[eH{r*{x)) + {l-e)H 



de 



1 



p(y) - 

—p(x) 



r*(y) 



1 



-r*{x) 



-//(/• '(;,)) - U [ P ^-^y \ +\H{r*{x))-\H( p{x) -f e {x) 



E 



r* (y) - p(y) . (p{y) - er* (y) 
— log 



1 



1 



+ ^E 



P( x ) log fP« 



er*(x) 



1 



Thus 



dg(e) 



de 



e=0 



D(r*(y)\\p(y)) - XD(r*(x)\\p(x)) > 0, 



where the last inequality is because < A < s*(X; Y) - 5 and } (y)My) ) > S *(X; Y) - 5. Since g(0) = this 

D[r'(x)\\p(x)) 

implies that for some e' > we have I(U e >',Y) - AI(U e >;X) > or that 



sup 



I(U;Y) > m^l >x 



U:U-X-Y,I(U;Y)>qI(U;X) I{U e >\X) 
Since the above holds for all A < s*(X; Y) — 5 we have 

I(U;Y) 



U-. u-x-y,i(u-y)>qI(U)X) 



>s*(X;Y)-5. 



Finally, since S > is arbitrary, we are done. ■ 

Remarks: 

• Note that p^X; Y) is symmetric in the pair (X, Y) but s*(X; Y) is not, i.e. s*(X; Y) ^ s*(Y; X) in general. 
Thus, sup^ u-X-Y,l(U;X)>o J^) + sn PV: X-Y-VJ(V;Y)>0 T^§\ in general, which is a qualitatively 



different phenomenon than predicted by the incorrect Erkip-Cover claim in ( [TO] ) above. 

This theorem also explains the motivation for our counterexample of the previous subsection. The plot of 



p{x) h-> H(Y)—Q.6H(X) for the channel p(y\x) described earlier is given in Fig. II-B The second derivative of 
the function p(x) i-> H(Y) - 0.6H(X) at P(X = 0) = \ is zero. This validates the fact that p 2 m {X; Y) = 0.6. 



It is clear that the lower convex envelope of the curve does not pass through P(X = 0) 



The straight 
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line in the figure connects the values of the curve at and \ and clearly demonstrates that the line is not a 
tangent to the curve. Thus it is clear in the figure that p^ n {X; Y) is the local-convexity condition and not the 
condition for being on the convex envelope. 

• Thm. 8 of £l] asserts that for fixed p(y\x), 

max p m (X;Y) = max s*(X;Y). 

p(x) " p(x) 

Using the interpretation from Theorem [4] this is immediate since having a positive semidefinite Hessian at all 
points in the domain implies the graph is convex. Thus, both quantities above equal the minimum value of A 
such that the function p{x) \-t H(Y) — XH(X) is convex. 

• The above characterization of s*(X; Y) is also partly motivated from Korner and Marton's characterization in 
lTl3l of less noisy broadcast channels, where they show that for a broadcast channel X — > (Y, Z) the following 
holds: 

I(U; Y) > I(U; Z) V U -► X -»• (Y, Z) <=> D(r(z)\\p(z)) < D(r{y)\\p{y)) Vr(x),p(s), 

where r(y), r(z) are the corresponding channel-induced distributions at Y and Z when X ~ r(x) and similarly 
p{y),p{z) are the corresponding channel-induced distributions at Y and Z when X ~ p{x). 

C. Alternate proof for the tensorization of s*(X; Y) 

The above characterization of s*(X;Y) results in an alternate proof of its tensorization. This proof is di- 
rectly motivated by the factorization inequalities in broadcast channels, some of which can be found in iPTll . 
Take a distribution of the form p(xi,x 2 ,yi,y 2 ) = Pi {x\ )pi(yi| xx)p 2 {x 2 )p 2 (2/2 \x 2 ). The easy direction is that 
s*(X 1 X 2 ;Y 1 Y 2 ) > max(s*(X 1 ;Y 1 ),s*(X 2 ;Y 2 )). This easily follows from the definition of s*(X;Y). Thus the 
non-trivial part is to show that s*(XiX 2 ;YiY 2 ) < max(s* (X^Yi), s* (X 2 ;Y 2 )). 

Let A := max(s*(Xi; Y\), s*(X 2 ; Y 2 )). With /C denoting the lower convex envelope operator, as earlier, we have 
t x (Xi) = K[t\](Xi) at pi(xi) and t x (X 2 ) = K[t\](X 2 ) at p 2 (x 2 ), where t\(Xi) denotes H(Y 1 ) - XH(X X ) and 
t x (X 2 ) denotes H(Y 2 ) - XH(X 2 ). 

We need to show that t x (X 1 ,X 2 ) = K,[t x \(Xi,X 2 ) at pi{xi)p 2 (x 2 ), where t x (X 1 ,X 2 ) denotes H(Yx,Y 2 ) - 
\H(X\, X 2 ), thought of as a function of p{x\,x 2 ), with the channel given byp(yi, y 2 \x\, x 2 ) = Pi{y\\x\)p 2 (y 2 \x 2 ). 

Since for any W satisfying the Markov chain W — X\X 2 — Y{Y 2 , we have 

H{Y U Y 2 \W) - XH{Xx,X 2 \W) = H(Y X \W) - \H{X X \W) + H(Y 2 \W, Y x ) - \H{X 2 \W,X X ) 

> H(Y X \W) - \H{X X \W) + H(Y 2 \W,Y 1 ,X 1 ) - \H(X 2 \W,X 1 ) 
= HWW) - \H(X 1 \W) + H(Y 2 \W,X 1 ) - XH{X 2 \W,X{) , 

we conclude that 

K[t x ]{X u X 2 ) > /C[t A ](Xx) +K[t x )(X 2 ) . 

This inequality in fact holds for all A and for all p(x\,x 2 ), not just for the specific A under consideration and at 
Pi{x\)p 2 {x 2 ), which is where we want to use it. 
Now with {X\,X 2 ) ~ Pi{xi)p 2 {x 2 ), we also have 

HiY^Y^ - XH{X 1 ,X 2 ) = H{Yi) - XH^) + H(Y 2 ) - XH(X 2 ) , 

i.e. we have ^(-^1,^2) = ^a(^i) + ^(-^2) at pi(xi)p 2 (x 2 ). We can put together the facts so far to write 

t x (X u X 2 ) = t x (X 1 ) + t x (X 2 ) = lC[t x ](X 1 ) + K,[t x )(X 2 ) < IC[t x ](X 1 ,X 2 ) , 

holding for the specific A as defined above and for (Xi,X 2 ) ~ Pi(xi)p 2 (x 2 ). But by our characterization of 
s*(XiX 2 ;Y 1 Y 2 ), this implies that s*(X 1 X 2 ;Y 1 Y 2 ) < max{s*(X 1 ;Y 1 ), s*(X 2 ;Y 2 )}, completing the proof of the 
non-trivial direction. 
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III. Conclusion 

In this paper we presented a new geometric characterization of the maximal correlation, p m (X; Y), of a pair of 
discrete random variables (X, Y) taking values in finite sets. We also presented a new geometric characterization 
of the chordal slope of the nontrivial boundary of the hypercontractivity ribbon of (X, Y) at infinity, s*(X; Y). We 
showed the application of these new characterizations in recovering some of the known results about these quantities 
in a simple way. We also made a correction to a data processing inequality claimed by Erkip and Cover |__, the 
error in which has had some knock-on effects in the literature. It would be interesting to find other connections 
between the curve t\(X) that we have associated to the channel p(y\x) and the entire hypercontractivity ribbon of 
(X, Y), as we vary p(x). 

ACKNOWLEDGEMENTS 

S. Kamath and V. Anantharam gratefully acknowledge research support from the ARO MURI grant W91 1NF-08- 
1-0233, "Tools for the Analysis and Design of Complex Multi-Scale Networks", from the NSF grant CNS-0910702, 
and from the NSF Science & Technology Center grant CCF-0939370, "Science of Information". The work of 
Chandra Nair was partially supported by the following: an area of excellence grant (Project No. AoE/E-02/08) 
and two GRF grants (Project Nos. 415810 and 415612) from the University Grants Committee of the Hong Kong 
Special Administrative Region, China. 

References 

[I] R. Ahlswede and R Gacs "Spreading of Sets in Product Spaces and Hypercontraction of the Markov Operator, " 
[2] S. Beigi, "A New Quantum Data Processing Inequality," arXiv: 1210.1689. 

[3] T. A. Courtade, "Outer Bounds for Multiterminal Source Coding based on Maximal Correlation", arXiv: 1302.3492 

[4] E. Erkip and T. Cover, "The efficiency of investment information," IEEE Transactions On Information Theory, vol. 44, pp. 1026-1040, 

May 1998. 
[5] P. Gacs and J. Korner, "Common information is far less than mutual information," Problems of Control and Information Theory, Vol. 2, 

No. 2, 1973, pp. 149 -162. 
[6] H. Gebelein, "Das statistische Problem der Korrelation als Variations- und Eigenwert-problem und sein Zusammenhang mit der 

Ausgleichungsrechnung," Zeitschrift fur angew. Math, und Mech. 21, pp. 364-379 (1941). 
[7] Y. Geng and C. Nair, "The capacity region of the two-receiver vector gaussian broadcast channel with private and common messages," 

Feb. 2012, 1202.0097. 
[8] A. Gohari and V. Anantharam, "Evaluation of Marton's Inner Bound for the General Broadcast Channel," IEEE Transactions on 

Information Theory, Volume 58, Issue 2, Feb. 2012. 
[9] H. O. Hirschfeld, "A connection between correlation and contingency," Proc. Cambridge Philosophical Soc. 31, pp 520-524 (1935). 
[10] S,-L, Huang and L. Zheng, "Linear Information Coupling Problems", IEEE Symposium On Information Theory (ISIT), 2012. 

[II] S. Kamath and V. Anantharam, "Non-interactive Simulation of Joint Distributions: The Hirschfeld-Gebelein-Rnyi Maximal Correlation 
and the Hypercontractivity Ribbon," Proceedings of the 50th Annual Allerton Conference on Communications, Control and Computing 
2012, Monticello, Illinois. 

[12] W, Kang and S. Ulukus, "A New Data Processing Inequality and Its Applications in Distributed Source and Channel Coding," IEEE 

Transactions on Information Theory 57, 56-69 (2011) 
[13] Janos Korner and Katalin Marton, "Comparison of two noisy channels," Topics in Inform. Theory(ed. by I. Csiszar and PElias), 

Keszthely, Hungary, August, 1975, pp 411-423. 
[14] G. Kumar, "On sequences of pairs of dependent random variables: A simpler proof of the main result using SVD," On webpage, July 

2010, http://www.stanford.edu/~gowthamr/researchAVitsenhausen_simpleproof.pdf 
[15] G. Kumar, "Binary Renyi Correlation: A simpler proof of Witsenhausen's result and a tight lower bound," On webpage, July 2010, 

http://www.stanford.edu/~gowthamr/research/binary_renyi_correlation.pdf 
[16] E. Mossel, K. Oleszkiewicz and A. Sen, "On Reverse Hypercontractivity", Geometric and Functional Analysis, 2013, pp. 1-36. 
[17] A. Renyi, "On measures of dependence.'Acta Math. Hung., vol. 10, pp. 441-451, 1959. 
[18] H.S. Witsenhausen, "On sequences of pairs of dependent random variables," SIAM Journal on Applied Mathematics, vol. 28, no. 1, 

pp. 100-113, January 1975. 
[19] L. Zhao and Y.-K. Chia, "The efficiency of common randomness generation", 49th Annual Allerton Conference on Communication, 

Control, and Computing (Allerton), 2011, pp. 944 - 950. 



